Sorting Out the Document Identifier Assignment Problem

نویسنده

  • Fabrizio Silvestri
چکیده

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimize Document Identifier Assignment for Inverted Index Compression

Document identifier assignment is a technique for inverted file index compression, by reducing d-gap value of posting lists. It was approached by either TSP or clustering methods in existing study. However, there is no proper formulation for this problem and the existing approaches has no theory guarantee to be good approximations. In this paper, we first formulate document identifier assignmen...

متن کامل

An Improved Assignment Algorithm Based Rotational Angular Sorting Methods

The data assignment problem occurs for multiple targets tracking application. It is crucial for the overall performance. In this paper, two observations of date assignment are considered, and then a new rotational sorting algorithm based on maximum likelihood principle was presented. The given algorithm of O(logN) and O(N) complexity developed is faster than the more popularly used Hungarian ty...

متن کامل

A new multi-objective model for berth allocation and quay crane assignment problem with speed optimization and air emission considerations (A case study of Rajaee Port in Iran)

Over the past two decades, maritime transportation and container traffic worldwide has experienced rapid and continuous growth. With the increase in maritime transportation volume, the issue of greenhouse gas (GHG) emission has become one of the new concerns for port managers. Port managers and government agencies for sustainable development of maritime transportation considered "green ports" t...

متن کامل

Primal and dual assignment networks

This paper presents two recurrent neural networks for solving the assignment problem. Simplifying the architecture of a recurrent neural network based on the primal assignment problem, the first recurrent neural network, called the primal assignment network, has less complex connectivity than its predecessor. The second recurrent neural network, called the dual assignment network, based on the ...

متن کامل

A Tree-Based inverted File for Fast Ranked-Document Retrieval

Inverted files are widely used to index documents in large-scale information retrieval systems. An inverted file consists of posting lists, which can be stored in either a document-identifier ascending order or a document-weight descending order. For an identifierascending-order posting list, retrieving ranked documents necessitates traversal of all postings, whereas for the weight-descending-o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007